---
title: "Tutorial: Inference Evaluations"
sidebarTitle: "Tutorial"
description: Learn how to use the TensorZero Inference Evaluations to build principled LLM-powered applications.
---

This guide shows how to define and run inference evaluations for your TensorZero functions.

<Tip title="New to TensorZero?">

See our [Quickstart](/quickstart/) to learn how to set up our LLM gateway, observability, and fine-tuning &mdash; in just 5 minutes.

</Tip>

<Tip>

**You can find the code behind this tutorial and instructions on how to run it on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/evaluations/tutorial).**

Reach out on [Slack](https://www.tensorzero.com/slack) or [Discord](https://www.tensorzero.com/discord) if you have any questions. We'd be happy to help!

</Tip>

## Status Quo

Imagine we have a TensorZero function for writing haikus about a given topic, and want to compare the behavior of GPT-4o and GPT-4o Mini on this task.

Initially, our configuration for this function might look like:

```toml
[functions.write_haiku]
type = "chat"
user_schema = "functions/write_haiku/user_schema.json"

[functions.write_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini"
user_template = "functions/write_haiku/user_template.minijinja"

[functions.write_haiku.variants.gpt_4o]
type = "chat_completion"
model = "openai::gpt-4o"
user_template = "functions/write_haiku/user_template.minijinja"
```

<Accordion title="User Schema & Template">

```json title="functions/write_haiku/user_schema.json"
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "topic": {
      "type": "string"
    }
  },
  "required": ["topic"],
  "additionalProperties": false
}
```

```text title="functions/write_haiku/user_template.minijinja"
Write a haiku about: {{ topic }}
```

</Accordion>

How can we evaluate the behavior of our two variants in a principled way?

One option is to build a dataset of "test cases" that we can evaluate them against.

## Datasets

To use TensorZero Evaluations, you first need to build a dataset.

A dataset is a collection of datapoints.
Each datapoint has an input and optionally a output.
In the context of evaluations, the output in the dataset should be a reference output, i.e. the output you'd have liked to see.
You don't necessarily need to provide a reference output: some evaluators (e.g. LLM judges) can score generated outputs without a reference output (otherwise, that datapoint is skipped).

Let's create a dataset:

1. Generate many haikus by running inference on your `write_haiku` function. (On **[GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/evaluations/tutorial)**, we provide a script `main.py` that generates 100 haikus with `write_haiku`.)
2. Open the UI, navigate to "Datasets", and select "Build Dataset" (`http://localhost:4000/datasets/builder`).
3. Create a new dataset called `haiku_dataset`.
   Select your `write_haiku` function, "None" as the metric, and "Inference" as the dataset output.

<Tip>

See the [Datasets & Datapoints API Reference](/gateway/api-reference/datasets-datapoints/) to learn how to create and manage datasets programmatically.

</Tip>

## Evaluations

Evalutions test the behavior of variants for a TensorZero function.

Let's define an evaluation in our configuration file:

```toml
[evaluations.haiku_eval]
type = "inference"
function_name = "write_haiku"
```

## Evaluators

Each evaluation has one or more evaluators: a rule or behavior you'd like to test.

Today, TensorZero supports two types of evaluators: `exact_match` and `llm_judge`.

<Tip title="Coming Soon">

We're planning to release other types of evaluators soon (e.g. semantic similarity in an embedding space).

</Tip>

### `exact_match`

The `exact_match` evaluator compares the generated output with the datapoint's reference output.
If they are identical, it returns true; otherwise, it returns false.

```toml
[evaluations.haiku_eval.evaluators.exact_match]
type = "exact_match"
```

### `llm_judge`

LLM Judges are special-purpose TensorZero function that can be used to evaluate a TensorZero function.

For example, our haikus should generally follow a specific format, but it's hard to define a heuristic to determine if they're correct.
Why not ask an LLM?

Let's do that:

```toml
[evaluations.haiku_eval.evaluators.valid_haiku]
type = "llm_judge"
output_type = "boolean"  # LLM judge should generate a boolean (or float)
optimize = "max"  # higher is better
cutoff = 0.95  # if the variant scores <95% = bad

[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"
```

<Accordion title="System Instructions">

```text title="evaluations/haiku_eval/valid_haiku/system_instructions.txt"
Evaluate if the text follows the haiku structure of exactly three lines with a 5-7-5 syllable pattern, totaling 17 syllables. Verify only this specific syllable structure of a haiku without making content assumptions.
```

</Accordion>

Here, we defined an evaluator `valid_haiku` of type `llm_judge`, with a variant that uses GPT-4o Mini.

Similar to regular TensorZero functions, we can define multiple variants for an LLM judge.
But unlike regular functions, only one variant can be active at a time during evaluation; you can denote that with the `active` property.

<Accordion title="Example: Multiple Variants for an LLM Judge">

```toml
[evaluations.haiku_eval.evaluators.valid_haiku]
type = "llm_judge"
output_type = "boolean"
optimize = "max"
cutoff = 0.95

[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"
active = true

[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_judge]
type = "chat_completion"
model = "openai::gpt-4o"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"
```

</Accordion>

The LLM judge we showed above generates a boolean, but they can also generate floats.

Let's define another evalutor that counts the number of metaphors in our haiku.

```toml {3}
[evaluations.haiku_eval.evaluators.metaphor_count]
type = "llm_judge"
output_type = "float"  # LLM judge should generate a boolean (or float)
optimize = "max"
cutoff = 1  # <1 metaphor per haiku = bad
```

We can also use different variant types for evaluators.
Let's use a chain-of-thought variant for our metaphor count evaluator, since it's a bit more complex.

```toml {2}
[evaluations.haiku_eval.evaluators.metaphor_count.variants.gpt_4o_mini_judge]
type = "experimental_chain_of_thought"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/metaphor_count/system_instructions.txt"
json_mode = "strict"
```

<Accordion title="System Instructions">

```text title="evaluations/haiku_eval/metaphor_count/system_instructions.txt"
How many metaphors does the generated haiku have?
```

</Accordion>

The LLM judges we've defined so far only look at the datapoint's input and the generated output.
But we can also provide the datapoint's reference output to the judge:

```toml {3}
[evaluations.haiku_eval.evaluators.compare_haikus]
type = "llm_judge"
include = { reference_output = true }  # include the reference output in the LLM judge's context
output_type = "boolean"
optimize = "max"

[evaluations.haiku_eval.evaluators.compare_haikus.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/compare_haikus/system_instructions.txt"
json_mode = "strict"
```

<Accordion title="System Instructions">

```text title="evaluations/haiku_eval/compare_haikus/system_instructions.txt"
Does the generated haiku include the same figures of speech as the reference haiku?
```

</Accordion>

## Running an Evaluation

Let's run our evaluations!

You can run evaluations using the TensorZero Evaluations CLI tool or the TensorZero UI.

<Tip>

The TensorZero Evaluations CLI tool can be helpful for CI/CD.
It'll exit with code 0 if all evaluations succeed (average score vs. `cutoff`), or code 1 otherwise.

</Tip>

<Tip>

By default, TensorZero Evaluations uses [Inference Caching](/gateway/guides/inference-caching/) to improve inference speed and cost.

</Tip>

### CLI

To run evaluations in the CLI, you can use the `tensorzero/evaluations` container:

```bash
docker compose run --rm evaluations \
    --evaluation-name haiku_eval \
    --dataset-name haiku_dataset \
    --variant-name gpt_4o \
    --concurrency 5
```

<Accordion title="Docker Compose">

Here's the relevant section of the `docker-compose.yml` for the evaluations tool.

You should provide credentials for any LLM judges.
Alternatively, the evaluations tool can use an external TensorZero Gateway with the `--gateway-url http://gateway:3000` flag.

```yaml
services:
  # ...

  evaluations:
    profiles: [evaluations]
    image: tensorzero/evaluations
    volumes:
      - ./config:/app/config:ro
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
      # ... and any other relevant API credentials ...
      - TENSORZERO_CLICKHOUSE_URL=http://chuser:chpassword@clickhouse:8123/tensorzero
    extra_hosts:
      - "host.docker.internal:host-gateway"
    depends_on:
      clickhouse:
        condition: service_healthy
# ...
```

See [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/evaluations/tutorial) for the complete Docker Compose configuration.

<Tip>

Docker Compose does _not_ start this service with `docker compose up` since we have `profiles: [evaluations]`.
You need to call it explicitly with `docker compose run evaluations`, as desired.

</Tip>

</Accordion>

### UI

To run evaluations in the UI, navigate to "Evaluations" (`http://localhost:4000/evaluations`) and select "New Run".

You can compare multiple evaluation runs in the TensorZero UI (including evaluation runs for the CLI).

![TensorZero Evaluation UI](./configuration-reference-evaluation-ui.png)
